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STABILITY OF THE GIBBS SAMPLER FOR BAYESIAN 
HIERARCHICAL MODELS 

By Omiros Papaspiliopoulos* and Gareth Roberts 

University of Warwick and Lancaster University 

, We characterise the convergence of the Gibbs sampler which sam- 

' pies from the joint posterior distribution of parameters and missing 

■ data in hierarchical linear models with arbitrary symmetric error dis- 



tributions. We show that the convergence can be uniform, geometric 
£Nj . or sub-geometric depending on the relative tail behaviour of the error 

distributions, and on the parametrisation chosen. Our theory is ap- 

■ plied to characterise the convergence of the Gibbs sampler on latent 

Gaussian process models. We indicate how the theoretical framework 
we introduce will be useful in analyzing more complex models. 

1. Introduction. Hierarchical modelling is a widely adopted approach 
£>. ■ to constructing complex statistical models. The appeal of the method lies in 

■ the simplicity in specifying a highly multivariate model by joining many sim- 
^ pie and tractable models, the foundational justification based on the ideas of 

\ partial exchangeability, the flexibility to extend or simplify the model in the 

light of new information, and the ease of inference using powerful Markov 
<^ • chain Monte Carlo (MCMC) methods which have been developed to this 

end during the last two decades. Thus, hierarchical models have been used 
^ ' in many areas of applied statistics such as geostatistics [S] , longitudinal anal- 

ysis [9], disease mapping [3], and financial econometrics [23] to name just a 
few. 
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A rather general form of a two-level hierarchical model is 



(1) 



X 



Y 



C{Y\X) 
C(X\@) 



where C(X) and C(Y \ X) denote the distribution of X and the conditional 
distribution of Y given X respectively. We will refer to Y as the data, X as 
the missing data and as the parameters. In a Bayesian context the model is 
completed by specifying a prior distribution for 0. Typically the dimension 
of X is much larger than that of and it can increase with the size of the 
data set. Most of the applications cited above fit into (1) by imposing the 
appropriate structure on C(Y \ X) and C(X | 0). It is straightforward to 
construct models with more levels. 

Bayesian inference for (1) involves the posterior distribution C(X, | 
Y = y). This is typically analytically intractable, but it can be sampled 
relatively easily using the Gibbs sampler [29] , by simulating iteratively from 
the two conditional distributions C(X \ @,Y = y), and £(0 | X, Y = 
y). It has been demonstrated both theoretically and empirically that the 
convergence (to be formally defined in Section 3) of the Gibbs sampler relates 
to the structure of the hierarchical model and particularly to the dependence 
between the updated components, X and 0. Nevertheless, the exact way in 
which the model structure interferes with the convergence remains largely 
unresolved. Concrete theoretical results exist only for Gaussian hierarchical 
models, but we will see that these results do not extend to more general cases. 
Although interesting characterizations of the convergence rate in terms of the 
dependence between X and exist when the Gibbs sampler is geometrically 
ergodic [1], there exist no general results which establish geometric ergodicity 
for the Gibbs sampler. The difficulty in obtaining such general results lies 
in the intrinsic dependence of the convergence of the Gibbs sampler on the 
model structure. 

In this paper we show explicitly how the relative tail behaviour of C(Y \ 
X) and C(X | 0) determines the stability of the Gibbs sampler, i.e. whether 
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the convergence is uniform, geometric or sub-geometric. Moreover, we show 
that the relative tail behaviour dictates the type of parametrisation that 
should be adopted. In order to retain tractability and formulate interpretable 
and easy to check conditions we restrict attention to the class of linear hier- 
archical models with general error distributions; the precise model structure 
is given in Section 2.1. Nevertheless, our main theoretical results, in par- 
ticular Theorems 3.3, 3.4, 3.5 and 6.3, and the methodology for proving 
them are expected to be useful in a much more general context than the one 
considered here. 

Consideration of the class of linear non-Gaussian hierarchical models is 
not merely motivated by mathematical convenience. These models are very 
useful in real applications, for example in longitudinal random effects mod- 
elling [9, 13], time series analysis [4, 12, 28] and spatial modelling [8]. They 
also are a fundamental tool in the robust Bayesian analysis [7, 20, 22, 30]. 
Furthermore, we will see that the stability of the Gibbs sampler for linear 
non-Gaussian models is very different compared to the Gaussian case, the 
local dependence between X and being crucial in the non-Gaussian case. 
Notice that several other models can be approximately written as linear non- 
Gaussian models. Actually, this work has been motivated by the behaviour 
of MCMC for non-Gaussian Ornstein-Uhlnebeck stochastic volatility models 
[23]. 

The paper is organised as follows. Section 2.1 specifies the models we 
will be concerned with and it establishes some basic notation. Section 2.2 
discusses Gibbs sampling under different parametrisations of the model and 
Section 2.3 motivates the theory and the methodology developed in this pa- 
per by a simple example. Section 3 is the theoretical core of this paper; the 
section commences with a short review of stability concepts for the Gibbs 
sampler; Section 3.1 recalls the existing results for Gaussian linear models; 
Section 3.2 develops stability theory for hierarchical models and states three 
main theorems for the stability of the Gibbs sampler; based on these theo- 
rems Section 3.3 provides the characterization of the stability of the Gibbs 
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sampler under different parametrisations for a broad class of linear hierarchi- 
cal models; Section 3.4 considers an alternative augmentation scheme when 
one of the error distributions is a scale mixture of normals and compares the 
convergence of a three-component Gibbs sampler with that of its collapsed 
two-component counterpart. Section 4 extends the theory to hierarchical 
models which involve latent Gaussian processes. Section 5 discusses exten- 
sions and contains some practical guidelines. Section 6 contains the proofs 
of all theorems and propositions. The proofs are based on establishing ge- 
ometric drift conditions and minorization conditions and using capacitance 
arguments in conjunction with Cheeger's inequality. 

2. Models, parametrisations and motivation. 

2.1. Linear hierarchical models. The models we consider in this paper 
are of the following form, where Yj is rrii x 1, Cj is rrii x p, Xj is p X 1, D 
is p x 1 and G is a scalar: 

Yj = CjXj + Zii , i = l,...,m 
(2) X, = DQ + Z 2l . 

Zij, i = 1, . . . ,m, are iid with distribution £(Zi), Zi2i, i = 1, • • • , m, are iid 
with distribution C(Z 2 ), and £(Zi) and C(Z 2 ) are symmetric distributions 
around (a vector of Os with the appropriate dimension). In the sequel, 
bold-face letters will correspond to vectors and matrices, capital letters to 
random variables and lower-case letters to their realisations. In this setting 
Y = (Yi, . . . , Y m ) and X = (Xi, . . . , X m ). The first equation in (2) will be 
termed the observation equation and the second the hidden equation. 

It is often conveniently assumed that both C(Z±) and C(Z 2 ) are Gaus- 
sian. However there are several applications where this assumption is clearly 
inappropriate, especially if we wish to make the inference about X robust 
in the presence of prior-data conflict. It is known [see e.g. 20, 22, 30, and 
references therein] that if the tails of C(Z±) are heavier than the tails of 
C(Z 2 ) then inference for X is robust to outlying observations, whereas if 
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C(Zi2) has heavier tails than £(Zi) inference for X is less influenced by the 
prior in case of data-prior conflict; these robustness is absent from Gaussian 
models. This type of robust modelling has been undertaken in time-series 
analysis, see for example [12]. 

2.2. Gibbs sampling and parametrisations. As is common in this frame- 
work, we place an improper flat prior on 0, which in this context leads to a 
proper posterior. Bayesian inference for (2) involves the joint posterior dis- 
tribution £(X, | Y = y), which will abbreviate to £(X, O | Y). Although 
it is often analytically intractable, it can be sampled easily using the Gibbs 
sampler. 

The parametrisation Vq := (X,0) is termed the centred parametrisation. 
This terminology was first used in the linear Gaussian context by [10]. Fol- 
lowing [21] we shall use the term more generally to refer to a parametri- 
sation where the parameters and the data are conditionally independent 
given the missing data. We can use the Gibbs sampler to collect samples 
from £(U, | Y) where U = /i(X,0), for some invertible transformation 
h, and then transform the draws to obtain samples from £(X,0 | Y). In 
the rest of the paper we will use V to refer to a general parametrisation 
(U,0). It is known [16] that the convergence (to be formally introduced 
in Section 3) of the Gibbs sampler improves as the dependence between 
the updated components, U and 0, decreases. Hence, the development of 
general re-parametrisation strategies has been actively researched, see [21] 
for a recent account. In that work, the authors introduce the non- centred 
reparametrisation V\ := (X, 0), which replaces X with X := /i(X, 0), where 
h is a transformation which makes and X apriori independent. In the con- 
text of linear hierarchical models X = (Xi, . . . , X m ), where Xj = /i(Xj, 0), 
and /i(x, 9) := x — D#. We will see that Vq and V\ present two natural 
choices. 

The prolific expansion in the use of Gibbs sampling for inference in hierar- 
chical models during the 1990s was fuelled by the apparent rapid convergence 
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of the algorithm in many cases. However, to date, there has been little theo- 
retical analysis linking the stability of the Gibbs sampler to the structure of 
hierarchical models. A notable exception are the explicit convergence results 
for Gaussian linear hierarchical models obtained in [24] and summarised in 
Section 3.1. The following example is revealing as to what might go wrong 
when considering non-Gaussian linear models, and motivates the methodol- 
ogy and theory developed in this article. 

2.3. A motivating example. Consider a simplified version of (1) where 
m = mi = C\ = D = 1, 

Y = X + Z l 

(3) x = e + z 2 . 

Assume that C{Z\) = Ca(0, 1), a standard Cauchy distribution, C(Zz) = 
N(0, 5), and y = is observed. Figure 2.3a shows the sampled values of G 
after two independent runs of the Gibbs sampler, each of 10 4 iterations. The 
top one is started from the mode, Oo = 0, and superficially it appears to 
be mixing well: the autocorrelation in the series becomes negligible after 10 
lags, and most convergence diagnostic tests would assess that the chain has 
converged. Nevertheless, the chain never exits the set (—40,40), although 
this is an event with stationary probability about 0.015. The second run, 
Figure 2.3a bottom, is started from Bo = 200, and the chain spends more 
than 4,000 iterations wondering around ©o- The contour plot of the joint 
posterior log-density of X and in Figure 2.3b, provides an explanation: 
the contours look roughly spherical near the mode, but they become asymp- 
totically concentrated around x = as \0\ — > oo. Thus, restricted to an area 
around the mode, X and look roughly independent, but in the tails they 
are highly dependent. In fact, C(X - 9 \ Y,@ = 9) -> N(0,5) as \9\ -> oo, 
and we show in Section 3.3 that the Gibbs sampler which updates X and 
converges sub-geometrically. In contrast, C(X \ Y, = 9) — > C(X), as 
\9\ — » oo, and as we show in Section 3.3 the Gibbs sampler which updates 
X and is uniformly ergo die. 
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3. Convergence of the Gibbs sampler for linear hierarchical mod- 
els. Given the parametrisation V = (U, 0), the two-component Gibbs 
sampler simulates iteratively from £(U|Y, = n _i), and £(0|Y,U = 
U n ), where ©o is a starting value and n > 1 denotes the iteration number. 
This algorithm generates a Markov chain {(U n , n )} with stationary distri- 
bution £(U, | Y). The marginal chain {0 n } is also Markov and reversible 
with respect to £(0 | Y) (Lemma 3.1. of [16]). Moreover, it can be shown 
[26] that the convergence rate of the joint chain coincides with the conver- 
gence rate of the marginal chain, {0 n }. Notice that this result does not hold 
for Gibbs samplers which update more than two components. In the sequel, 
for any random variables W and V, and probability law fj,, we will use the 
short-hand notation, 

C(V | W ~ //) := J £{V | W = w)n(dw). 
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We will consider the convergence of {O n } through the total variation 
norm, defined as 

||£h(0 n | Y, G ) - £(0|Y)|| = sup \E h {g(Q n ) | Y, O } - E{#(0)|Y}|. 

Isl<i 

Ch(Qn | Y, ©o) is the distribution of the chain after n steps started from 
@o, and Eh{g(Q n ) | Y, ©o} is the expected value of a real bounded function 
g with respect to this distribution. Ch(@ n I Y, ©o) clearly depends on the 
parametrisation U = /i(X,0), since, 

C h (@i | Y,0 O ) = C{ | Y,U~£(U| Y,0 = O )}. 

Under standard regularity conditions (Theorem 13.0.1 of [19]) the total vari- 
ation norm converges to as n — > oo. We say that {0 n } is geometrically 
ergodic when there exist an r < 1 and some function M(-), such that 

(4) \\£ h (O n I Y,0o) - A©|Y)|| < M(9 )r n . 

The smallest r for which (4) holds, say r^, is known as the rate of conver- 
gence of {0 n }- However, the actual distance from stationarity will in general 
depend on the starting point and this is represented by the term M(©o) in 
(4). When M(-) is bounded above, {0 n } is called uniformly ergodic. Uni- 
form ergodicity is a valuable property, since it ensures that the convergence 
of the chain does not depend critically on the initial value chosen. Whilst 
this does not guarantee rapid convergence, it ensures that the "burn-in" 
problem cannot become arbitrarily bad from certain starting points. 

Geometric ergodicity is a qualitative stability property, and geometrically 
ergodic algorithms may still converge slowly and give Monte Carlo estimates 
with high variance (for example when r^ ~ 1). However, algorithms which 
fail to be geometrically ergodic can lead to various undesirable properties, 
including the break down of the central limit theorem for ergodic average es- 
timates. In this case the simulation can be unreliable and the drawn samples 
might poorly represent the target distribution. 
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To keep nomenclature simple we will identify a parametrisation V = 
(U, G) with the Gibbs sampler which updates U and G. Thus, we say that 
a parametrisation V is geometrically (respectively uniformly) ergodic, if the 
Gibbs sampler implemented using this parametrisation is geometrically (re- 
spectively uniformly) ergodic. 

3.1. Gaussian models. The Gibbs sampler for the Gaussian linear model 
is geometrically ergodic with rate given in [24]. In the simplified model (3) 
assume that C(Zi) = N(0, af),i = 1, 2, and define k = 02/(02 + Then, 
[21] building on the results of [24] showed that, when U = h(X, G) = X—pQ, 



this setting, the dependence between U and Q is appropriately quantified 
by the correlation coefficient, and (5) shows that the larger the correlation 
the worse the convergence. Many refinements and generalizations of these 
results can be found in [24], [21] and [17]. Notice that both Vq and V\ are 
geometrically ergodic. Vo converges rapidly when the observation equation is 
"more precise" than the hidden equation, that is o\ « o~2, and it converges 
slowly when the hidden equation is relatively precise. V\ converges rapidly 
when the hidden equation is relatively more precise. 

3.2. General theory for linear hierarchical models. This section gives 
general results which can be used to characterise the stability of the Gibbs 
sampler on linear hierarchical models of the form (2) where the A^s are 
univariate and D = 1. Our results are valid when m > 1 and mi > 1 (see 
Remark 1 in page 13), however in order to keep the notation simple we will 
work with the simplified model (3), where all Y, X and Q are scalars. C{Z\) 
and C{Z2) are arbitrary symmetric distributions with continuous bounded 
everywhere positive densities, f\ and /2 respectively; common examples in- 
clude the Gaussian, the Cauchy and the double exponential. This section 
gives the general results, while Section 3.3 applies them to characterise the 
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convergence of the Gibbs sampler for (a broad class of) linear non-Gaussian 
hierarchical models. Section 4 deals with extensions where the AjS are vec- 
tors of dependent variables, therefore covering state-space and spatial mod- 
els. Nevertheless, the results even for the more structured models follow 
relatively easily from the results of this section. All proofs are deferred to 
Section 6. 

We begin by introducing a collection of posterior robustness concepts, 
which are related with the behaviour of the conditional posterior distri- 
bution C(U | Y, = 9) as \9\ — > oo. All these concepts have statistical 
interpretations but they turn out to provide the required mathematical con- 
ditions for characterising the stability of the Gibbs sampler, as we show in 
Theorems 3.3, 3.4 and 3.5 below. 

Definition 3.1. The parametrisation V = (U,Q) is called: 

1. partially tight in parameter (PTIP), if for all y, there is some k > 
such that, 



2. geometrically tight in parameter (GTIP), if there exist positive con- 
stants, a, b (independent of 9) such that for all 9, 



GTIP not only implies that C(U \ Y, Q = 9) is a tight family of distri- 
butions, but also that the tail probabilities are bounded exponentially. (We 
recall that a family of distributions on the real line, say Fg, indexed by a 
scalar 9, is called tight when lim^oo sup^ Fg([— k, k] c ) = 0.) Clearly, GTIP 
is much stronger condition than PTIP. We consider also the following model 
robustness concepts. 

Definition 3.2. We say that the linear hierarchical model (3) is 



(6) 



limsupP(|[/| > k\Y = y,G = 9) < 1, 

|0|->oc 



P(|I7| > x\Y = y,0 = 9) < ae 



—bx 
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robust in parameter (RIP), if 



0|->oo 



lim C(X\Y = y,e = 6) = C{Z 1 + y) 



2. 



robust in data (RID), if 



|0|-KX) 



lim C(X\Y = y,e = 6) = C(X) 



3. data uniformly relevant (DUR), if there exist positive constants d, k 
such that for all \9\ > k, 



4- parameter uniformly relevant (PUR), if there exist positive constants 
d, k such that for all \9\ > k, 



These definitions characterise the hierarchical model according to how 
inference for X (conditionally on O = 9) is affected by a large discrepancy 
between the data y and the prior guess 9. When the model is RIP inference 
for X ignores 9, and it is symmetric around y. Conversely, when the model 
is RID inference for X ignores the data and becomes symmetric around 9. 
When the model is DUR (PUR) the data (the parameter) always influences 
the conditional expectation of X. Notice that when the model is RIP Vq is 
PTIP (although not necessarily GTIP), and when it is RID V\ is PTIP. The 
example in Section 2.3 describes a RID model. A model can be both DUR 
and PUR (for example the Gaussian linear model). 

Theorem 3.3. Consider the linear hierarchical model (3) where the er- 
ror densities f\ and fi are continuous, bounded and everywhere positive. If 
Vq (Pi) is PTIP, then it is uniformly ergodic. 

Theorem 3.4. Consider the linear hierarchical model (3) where the er- 
ror densities f\ and fi are continuous, bounded and everywhere positive. If 
the model is RID then Vo is not geometrically ergodic, and if the model is 
RIP then V\ is not geometrically ergodic. 



E{X|y = y,e = 9}\ < \9\ - d, 



sgn(9)E{X -y\Y = y,Q = 9}> d. 
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Distribution 


Code 


Density g(x) up to proportionality 


Cauchy 


C 


0-7(1+3^) 


Double exponential 


E 


exp {— x|/cr} 


Gaussian 


G 


expj-(x»72} 


Exponential power distribution 


L 


exp{~\x/a\' 3 \ , /3 > 2 



Table 1 

Distributions for the error terms and their densities. In the paper they are coded 
according to the letter in the middle column. 



The proof Theorem 3.4 is based on the general Theorem 6.3 about Markov 
chains on the real line, which is stated and proved in Section 6. 

Theorem 3.5. 1. If the model is DUR, V\ is GTIP, and C{Z 2 ) has 
finite moment generating function in a neighbourhood of 0, then Vq is geo- 
metrically ergodic. 2. If the model is PUR, Vq is GTIP, and C(Z\) has finite 
moment generating function in a neighbourhood of 0, then V\ is geometri- 
cally ergodic. 

The theorems are proved by establishing a geometric drift condition. The 
requirements of GTIP for V\ (Vq) and finite moment generating function 
for C{Z2) {C{Zi)) are in order to tilt exponentially the linear drift condition 
provided by DUR (PUR). 

3.3. Characterising the stability of the Gibbs sampler according to the dis- 
tribution tails of the error terms. In this section, building upon the general 
theory of Section 3.2, we characterise the stability of the Gibbs sampler on 
the linear hierarchical model (3) for different specifications of C(Zi), L{Z<i). 
Although we consider the error distributions in Table 1, our proofs remain 
valid for much broader families of distributions (see Remark 2 on page 13). 
Notice that the exponential power distribution contains both the Gaussian 
{(5 = 2) and the double exponential {(3 = 1) as special cases. Here we con- 
sider densities with tails lighter than Gaussian (/? > 2). For the use of this 
distribution in Bayesian robustness see [5]. 

We shall specify linear models giving first C{Z\) and then £(^2), for 
instance the (C, E) model corresponds to (3) with Cauchy distribution for 
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Stability of V 








C 


E 


G 


L 




C 


u 


U 


U 


U 




E 


N 


G/U 


u 


U 




G 


N 


G 


G 


G 




L 


N 


G 


G 


G 



Stability of Vi 


C(Z 1 ) 






C 


E 


G 


L 




c 


U 


N 


N 


N 


C(Z 2 ) 


E 


u 


U/G 


G 


G 




G 


u 


U 


G 


G 




L 


u 


U 


G 


G 



Table 2 

Stability Vo (left) and Vi (right) for the linear hierarchical model (3) for specifications of 
the distribution of the error terms as in Table 1. 



Z\, and double exponential distribution for Z 2 . For each model we have two 
parametrisations, thus two algorithms, Vq and V\. When we refer to the 
stability of an algorithm we shall write U, G, and N to refer to uniform, 
geometric and non-geometric (i.e. sub-geometric) ergodicity, respectively. 

Theorem 3.6. The stability Vq and V\ is given in Table 2. 



Remark 1. The determining factor in classifying the stability of a parametri- 
sation is the tail behaviour of C{Z\) and L(Z-}). Thus, Theorem 3.6 gener- 
alises to the case of multiple random effects and observations: 

Yij = Xi + Z U j, j = 1, • • • , mi 
Xi = + Z 2 i, i = l,...,m 

where Z±.. and Z%. are independently distributed identically to C(Z\) and 
d{Z-i) respectively. This extension is immediate where obvious sufficient 
statistics exist (the C and N cases). However, since proving formally the 
full generalisation would be extremely tedious (although in the same lines 
as in Section 6), we do not attempt it here. 

Remark 2. The same results can be obtained when any of the distributions 
considered in Table 2 is replaced by another symmetric distribution with 
the same tail behaviour, which possess a bounded continuous everywhere 
positive density. 
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Remark 3. Different results hold when a proper prior for B is imposed. In 
this case the convergence improves. 

Remark 4- The results of Theorem 3.6 are independent of the actual value 
of y. This does not necessarily hold in other contexts. 

Remark 5. In the (E, E) model, the stability depends on the ratio of the 
scale parameters in C{Z\) and C{Z-i). Depending on this ratio convergence 
can be either geometric or uniform (see Section 6 for details). 

Remark 6. The following heuristic can be derived from Table 2: convergence 
of Vq is best when C{Z\) has lighter tails than C(Z2), and worst when it has 
heavier tails. The situation for V\ is the reverse. Both algorithms become 
more stable the lighter the tails of C{Z\) and d{Z<i) become. 

3.4. Convergence of the grouped Gibbs sampler. An alternative augmen- 
tation scheme and sampling algorithm can be adopted when one of the 
error distributions, say C{Z2) for convenience, is Gaussian and the other, 
say C(Zi), is a scale mixture of Gaussian distributions. Several symmet- 
ric distributions belong in this class, for instance the Student-t (thus the 
Cauchy) and the double exponential [2]. In this case, Z\ can be represented 
as Z\ = V/Q, where V has a standard Gaussian distribution and Q is 
positive and independent of V. We can treat Q as missing data and con- 
struct a three- component Gibbs sampler which updates iteratively X, Q 
and from their conditional distributions. (When X = (X±, . . . ,X m ) then 
Q = (Qi, ■ ■ ■ ,Qm) where Qi is independent from Qj for every i ^ j). A 
major computational advantage of this approach is that C{X \ Y, 0, Q) is 
Gaussian and it can be easily sampled. Notice that Q and are indepen- 
dent given X, thus we can implement the Gibbs sampler using a grouped 
scheme [15] where and Q are updated in one block. It is of interest to 
know whether the convergence of this grouped Gibbs sampler is better than 
the convergence of the collapsed Gibbs sampler (as defined in [15]), where Q 
has been integrated out. The "Three-schemes Theorem" of [15] states that 
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the norm of the transition operator of the grouped Gibbs sampler is larger 
than the one which corresponds to the collapsed Gibbs sampler. This result, 
however, is not enough to guarantee that the collapsed sampler will have 
better convergence rate. 

In order to give a concrete answer, we consider the important special 
case, where C{Z{) is the Cauchy distribution, therefore Q ~ Ga(l/2, 1/2). 
We have the following proposition, whose proof is based on Theorem 6.3. 

Proposition 3.7. The grouped Gibbs sampler is not geometrically er- 
godic. 

This result remains true for a number of random effects m > 1, and it 
will hold for more general Student-t distributions. This result has important 
practical implications especially in algorithms for latent Gaussian models, 
considered in Section 4. It is also significant that it contrasts the result ob- 
tained by [27], who establishes geometric ergodicity for variance component 
models (of which the model considered here is a special case). However, the 
result in [27] is true when the number of data Y^, mj, per random effect Xj 
is larger than some number bigger than one, whereas in Lemma 3.7 we take 
mi = 1. 

4. Latent Gaussian process models. In this section we consider a 
rather specific though useful model and demonstrate that the results of 
Section 3.2 can be extended quite readily to this context giving some clear- 
cut conclusions and advice for practical implementation. The results below 
are certainly not the most general possible, but it is hoped that the method 
of proof will indicate how analogous models might be addressed. 



Theorem 4.1. Consider the latent Gaussian process model: 



Y = 



X + Z x 



X = 1G + 
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where Zi = {Zn, . . . Z\ p } is a vector of independent and identically dis- 
tributed standard Cauchy random variables, Z2 = {Z21 , ■ ■ ■ Zip\ is a vector 
of independent and identically distributed standard Gaussian random vari- 
ables, and 1 is a vector of 1 's. S is assumed known and a flat is prior is 
assigned to G. Then 1. Vq fails to be geometrically ergodic; 2. V\ is uniformly 
ergodic. 

As we remarked on page 13, the result holds when the Cauchy is gener- 
alised to a Student-t with any degrees of freedom. The MCMC for latent 
Gaussian process models is often implemented using a different augmen- 
tation scheme. As in Section 3.4, we can augment the model with Q = 
(Qi, . . . , Q p ), where C{Qi) = Ga(l/2, 1/2). However, a similar argument as 
in the proof of Proposition 3.7 shows that the Gibbs sampler which updates 
X, Q and G is not geometrically ergodic. 

As a numerical illustration we consider a linear non-Gaussian state-space 
model: X±, . . . ,X p are consecutive draws from an AR(1) model, which are 
observed with Cauchy error. We have simulated p = 100 data from this 
model using Q = 0. The update of G given X is from a Gaussian distribution, 
however the update of X given Q and Y is non-trivial. We update all the 
states together using a highly efficient Langevin algorithm, see [6] for details. 
Moreover, we perform several updates of X for every update of G so that our 
results are not critically affected by not being able to simulate directly from 
£(X I Y,G). Figure 4 depicts our theoretical findings. Vq has a random 
walk-like behaviour in the tails, whereas V\ returns rapidly to the modal 
area. On the other hand, Vq mixes better than V\ around the mode. Note 
that the instability of Vq in the tails is not due to lack of information about 
G but due to the robustness properties of the model. 

In this context it is definitely advisable to mix between Vq and V\, i.e to 
use a hybrid sampler which at every iteration with some probability updates 
(G,X) and with the remaining probability it updates (G,X). This hybrid 
sampler will inherit the uniform ergodicity from V\ but it will also mix well 



imsart-aos ver. 2007/01/24 file: papaspiliopoulos_roberts_stabilty-REVISION.tex date: February 5, 2008 



STABILITY OF GIBBS SAMPLER 



17 




Fig 2. Two runs of Vo (left) and V\ (right) with two different starting values: Oo = 
(top) and 0o = 500 (bottom). 

around the modal area. 

5. Discussion. We have obtained rigorous theoretical results for the 
stability of the Gibbs sampler which explores the posterior distribution aris- 
ing from a broad class of linear hierarchical models. We have also proved 
results regarding more complicated hierarchical models with latent Gaus- 
sian processes, and we have compared different sampling schemes. We have 
shown how the model structure dictates which parametrisation should be 
adopted for improving the convergence of the Gibbs sampler. 

Our results are certainly not the most general possible, though the method 
of proof we have used indicates clearly how analogous problems might be 
addressed. As an example of this, it is easy to extend the conclusions of 
Table 2 to the case where the light-tailed distributions are replaced by (say) 
uniform distributions on finite ranges. The robustness concepts of PTIP, 
GTIP, RIP and RID are already stated in a general form, while the con- 
cepts of DUR and PUR can be translated in a natural way using Lyapunov 

imsart-aos ver. 2007/01/24 file: papaspiliopoulos_roberts_stabilty-REVISION.tex date: February 5, 2008 



18 



PAPASPILIOPOULOS AND ROBERTS 



drift conditions. Families of models to which we are currently investigating 
extensions of our methods, include stochastic volatility models prevalent in 
finance. This is the subject of on-going research by the authors. 

The general heuristic is clear - the stability of the centred and non-centred 
algorithms, Vo and V\ respectively, depends on the relative tail behaviour of 
C{Z\) and £(^2), with the centred method being more stable when C{Z\) is 
relatively light tailed, and the non-centered being more stable when £(^2) 
is relatively light tailed. An additional conclusion of Table 2 is that, as 
expected, both algorithms possess comparatively more stable convergence 
properties the lighter the tails of C{Z\) and £(^2) become. 

The main message of the paper for the MCMC practitioner is a positive 
one: the competition between Vo and V\ works to the user's benefit. Our 
results suggest that a combination of Vq and V\ is often desirable. When 
the tails of the error distributions are very different we have found that 
one of the algorithms might be very good for visiting the tails of the target 
distribution whereas the other for exploring the modal area (as for example 
we demonstrate in Figure 4). Therefore, it is advisable to use a hybrid Gibbs 
sampler which at every iteration with some probability updates (O, X) and 
with the remaining probability it updates (@,X). Moreover, by linking the 
stability of the Gibbs sampler to the robustness properties of the hierarchical 
model we provide intuition which can be found useful for models outside the 
scope of this paper. 

Another interesting product of this work is that linear re-parametrisations, 
which can substantially improve the convergence rate in (approximately) 
Gaussian models, might be of little relevance when the tail behaviour of 
C(Zi) is very different from £(.^2). For example, in (C,G) model, where 
the observation error is Cauchy and the prior for X is Gaussian, we can 
prove that the Gibbs sampler which updates U = X — pQ and is sub- 
geometrically ergodic for all p < 1, whereas it is uniformly ergodic for p = 1 
as we already know from Theorem 3.6. This emphasizes the special role of 
V\, which differs because of the prior independence it induces on X and 0. 
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This result suggests that conditional augmentation (as in [18]) algorithms 
might fail to be geometrically ergodic when Vq does. 

All the results presented here are specific to the Gibbs sampler, however 
our findings are clearly relevant to contexts where certain direct simulation 
steps have to be replaced by appropriate Metropolis-Hastings steps (as for 
example in the simulation illustration in Section 4). 

It is worth mentioning that once we have established geometric ergodicity 
for an algorithm, it is important to obtain computable bounds on the rate 
of convergence. We have not attempted to do so, since it is outside the focus 
of this paper. For advances in this direction see for example [11, 27]. 

One interesting feature resulting from this paper is that the marginal 
chain {0 n } of the Gibbs sampler on linear non-Gaussian models often be- 
haves asymptotically (i.e in the tails) like a random auto-regression of the 
form: 

©n = Pn&n-l + 

where p n is a random variable taking values in [0,1], and e n is an error 
term. For instance in the (G, G) case of Theorem 3.6 for Vo (Vi) p n is 
deterministically equal to ro (r\) defined in Section 3.1. The cases where we 
demonstrate that the algorithm is random-walk like correspond to taking 
p n = 1 (almost surely). Furthermore in a number of cases, p n is genuinely 
random. For instance, in the (E, E) case with identical rates, p n ~ U[0, 1]. 
In the (C,C) case, we find that p n takes the value or 1 with probabilities 
determined by the scale parameters of the Cauchy distributions involved. 

An extension of our ideas is possible for hierarchical models with more 
levels. For instance consider the linear structure given by 

Y = Q 1 + Z 1 
(7) Qi = @ i+1 + Z i+1 , i = l,...d-l, 

with a flat prior on G^. Since Y is the only information available, the poste- 
rior tails of 0i, 02 • • ■ become progressively heavier. If at any stage, Zj has 
lighter tails than Zi-i, then whenever 0i_i and 0«+i strongly disagree, the 
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conditional distribution of 0j given Y, 0_j will virtually ignore 0j_i and 
hence the data. This will lead to potential instabilities in the chain in com- 
ponents Qi, Oj+i, . . . , ®d- We call this phenomenon the quicksand principle, 
and this is the subject of ongoing investigation by the authors. 

6. Proofs of main results. In the sequel we will use ir to denote the 
density of any stationary measure, in particular ir(9 \ y) and ir(x \ y, 9) 
will be the Lebesgue densities of £(0 | Y = y) and C{X \ Y = y, = 9) 
respectively. With p(-, •) we denote the transition density of a Markov chain, 
and with ©o and 0i the consecutive values of the marginal chain {© n }- 

PROOF OF Theorem 3.3. We show the result for Vq, since the corre- 
sponding result for V\ can be proved in an analogous way. In particular, we 
show that when Vo is PTIP, the transition density of the the marginal chain 
{© n }, is such that mfg p(9o,9i) > 0, and p is also continuous in 9\. This 
guarantees uniform ergodicity by Theorem 16.0.2 of [19]. 



p(0oA) = f2(\x-9 1 \)7r(x\y,9 )dx> f 2 (\x-9 1 \)n(x\y,9 )dx 



for k such that (6) holds. Since /i and f 2 are everywhere positive, bounded 
and continuous, P(|^| < k\Y = y,Q = 9q) is also positive and continuous 
in #o> therefore by the PTIP property it follows that inf# P(|A"| < k\Y = 
y, G = #o) > 0. Moreover, inf| x |<fc f 2 {\x — is positive and continuous in 



The proof of Theorem 3.4 requires Theorem 6.3, hence it is proved on 
page 24. The proof of Theorem 3.5 requires the following lemmas. 




> inf / 2 (|s-0i|) P(|X| <k\Y = y,e = 9 ) 



9\, thus the result follows. 



□ 



Lemma 6.1. 1. If (3) is DUR and the parametrisation (X, ©) is GTIP, 
then for all sufficiently small a > 0, 




where k,d are defined in Definition 3.2. 
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2. If (3) is PUR and the parametrisation (X, 0) is GTIP, then for all 
sufficiently small a > 0, 

E^v-x^Y = y,e = 0} < e ad {l-ad/2), for 6 > k 
E{e- a fo-*)|Y = y,@ = #} < e~ a \\ - ad/2), for 9 < -k, 

Proof. 1. We will prove only the first inequality, for 9 > k, since the 
other is proved in a similar fashion. We define Gg(t) = E |e* ( x ~^ \Y,Q = 0\, 
which is finite for all sufficiently small t > 0, say < t < to f° r some to, 
and for all 9, since by the GTIP assumption C(\X — 9\ \ Y,Q = 9) has expo- 
nential or lighter tails. By a second order Taylor series expansion of Gg(t) 
around t = 0, we obtain for some < t\ < to, and for 9 > k, 

t 2 

G g (t) = l + tE{X-9\Y,0 = 9} + -E{{X-9) 2 e tl{x - e) \Y,G = 9} 
< i_^ + ^ E {(X -9) 2 e^ x ^\Y,e = 9}. 

Now pick a <t\ small enough so that for all 9 > k aE {(X - efe tl{ - x -^\Y,@ = 9} < 
d. Such a exists due to the GTIP assumption. Then, Gg(a) < 1 — ad/2, and 
the result follows. 2. It is proved as 1, recognising that X = X — 9. □ 

Lemma 6.2. 1. If (3) is DUR and the parametrisation (X, 6) is GTIP, 
then for all sufficiently small a > 0, 

E{e Q l x l|Y,9 = 0} < e a|e| (l - ad/2) + K, for \9\ > k, 

where k, d are defined in Definition 3.2, and < K < 00. 
2. If (3) is PUR and the parametrisation (X, 0) is GTIP, then for all 
sufficiently small a > 0, 

E{e a \ y - X \\Y = y,Q = e\ < e aW (1 - ad/2) + K, for \9\ > k, 

where k, d are defined in Definition 3.2, and < K < 00. 
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Proof. 1. We prove the result for 9 > exploiting the first inequality 
given in Lemma 6.1. The case < is proved analogously but exploiting 
the second inequality of Lemma 6.1. Notice that 

E {e a|x| |Y, 6 = j < E {e aX \Y, = #} + J e~ ax n{x \ y, 6)dx, 

thus, due to Lemma 6.1 we only need to show that the second term of 
the sum above can be bounded above for all 0. Recall a, b from the GTIP 
Definition 3.2. Choose a < b. Using integration by parts, we find that the 
second summand is bounded above by, e~ bS [a+a/(b— a)], which can easily be 
bounded above for all 9 > k. 2. It is proved as 1, recognising that X = X — 0. 

□ 

PROOF OF Theorem 3.5 1. We prove the result establishing a geometric 
drift condition for the marginal chain {0 n }, using the function V(9) = e a ' e ', 
for appropriately chosen a > 0. Notice first that £(0 | Y, X = x) = £(0 | 
X = x) is symmetric around x and has a finite moment generating function 
in a neighbourhood of the origin. Thus, working as in Lemma 6.1 and Lemma 
6.2, we can show that for all sufficiently small a > 0, there exists K\ > 
and e > 0, such that, 

E { e a|0| \x = x}<(l + a 2 e) e" 1 * 1 + K x . 
Then, for \9q\ > k, and appropriate K\ > 0, K > 0, 

E { e «l e il | Y, O = 9 } = B{B{e a ^ \ X x } \Y,Q = 9 } 

< E{(l + a 2 e)e a ^ +K 1 \Y,e = 9 } 

< {l + a 2 e){l-ad/2)e a \ e °\ +K 

< (1 -a6)e al0 ° l +K. 

Now since standard arguments (see for example [25]) show that compact sets 
are small for this problem, the Gibbs sampler is shown to be geometrically 
ergodic by Theorem 15.0.1 of [19]. 

2. The second result is proved almost identically. Notice that £(0 | Y = 
y, X = x) is symmetric around y — x and possesses finite moment generating 
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function in a neighbourhood of 0, thus as we showed above, for all sufficiently 
small a > 0, there exists a K\ > such that, 

E{e a|e| | Y = y, X = x} < (l + a 2 e) e^"* 1 + K x . 

Using Lemma 6.2 and arguing as in 1 proves the theorem. □ 
Before proving Theorems 3.4 and 3.6 we need the following general result 
about Markov chains on the real line. 

Theorem 6.3. Let {W n } be an ergodic and reversible with respect to 
a density tt, Markov chain on R with transition density p(x, y) which is 
random walk-like in the tails, in the sense that there is a continuous positive 
symmetric density q such that 

(8) lim p(x, x + z) = q(z), z 6 R. 

\x\— +oo 

Then 

1. tt has heavy tails, in the sense that 

(9) lim MZlM^ = lim lQg/r > ( ^ = Q; 

x^oo x x— »oo — x 

2- {Wn} is not geometrically ergodic. 

PROOF 1. We will prove the result for x —* oo, since the case x — > — oo, is 
proved in the same way. Fix z, 5 € R + , and let W denote a random variable 
which has density tt. By (8), there exists k > such that for x > k 

4^4<(i+5). 

p(x, X + z) 

This uses the fact that q(z) > 0. Thus by reversibility, and for x > k, 

gQg) p(x + z,x) 
ir{x + z) p(x,x + z) 

so that 

(10) ir(x + z) > (1 + 8y\(x) . 
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Integrating (10) over x > k, gives that 

(11) P(W > k + z) > (1 + 5)~ l P{W > k) . 

Iterating this expression, and after some algebra, we get that 

logP(W > k + nz) 
hm > — o, 

n — >oo ii 

which, since 5 can be chosen arbitrarily small, proves the statement. 

2. The second follows from the following standard capacitance argument; 
see [25] for similar arguments for MCMC algorithms and [14] for an intro- 
duction to Cheeger's inequality using capacitance. Cheeger's inequality for 
reversible Markov chains implies that geometric ergodicity must fail if we 
can find k > 0, such that the probability 

P(|Wi|<*|Wb~7r ( _ fc>fc) c) 

is arbitrarily small, where we use 7iv ^ Me to denote the density ir restricted 
and re-normalised to the set {\x\ > k}. Notice that (11) implies that for 
sufficiently large k, for \x\ > k, and any I > 0, there 

P(|Wi| >x + l\W >k)> (1 + 5)" 1 > 1-5 . 

Now choose / sufficiently large that Jj 30 q(u)du < 5 then for all \x\ > k, 

P(\W 1 \<k)<P(\W 1 \<k\W r, 7T{ _ k ^)+-p(\W 1 -Wo\>l) 

which converges as \x\ — > oo to a limit bounded by 35. Since 5 is arbitrary, 
the result is proved. □ 
PROOF OF Theorem 3.4 we prove the theorem for the case where the 
model is RID, since the proof when the model is RIP is identical. We will 
show that under the assumptions the marginal chain {G n } generated by 
the centred Gibbs sampler is random walk-like, thus by Theorem 6.3 Vq is 
not geometrically ergodic. By assumption, limigi^^ C(X\Y, = 9)= C(X), 
which is symmetric around 0, and let F denote its corresponding distribution 
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function. Therefore P(X < 9 + z | Y, 9 = 9) -> F(z), as |0| -> oo. Notice 
that, 

p(0o,0o+^) = | / 2 (|x-e -«|)dF(a: | Y, 9 = O ) = / / 2 (|u-z|)dF(u+0 o | Y, 9 = O ), 

therefore, since / 2 is bounded, p(9o, 9q + z) — ► / / 2 (|u — zl)dF('u) = <z(z), as 
|#o| ~~ > °°! where g is a symmetric density around 0. □ 
PROOF OF Theorem 3.6 Throughout the proof we shall use the following 
notation: fx and / 2 denote the density of Z\ and Z 2 respectively (at least 
up to proportionality), and we define 

fe{x) = fi{\y-x\)h{\x-0\\ 

thus, 7r(x I y,9) = fe(x)/cg, where eg is the normalisation constant. Any 
scale parameter involved in /j will be denoted by <7j, i = 1, 2. 

For each model, we first prove the result for Vo and subsequently for V\. 
We will prove the statements corresponding to the upper triangular elements 
of the Vo and V\ tables. This is without loss of generality, since we can write 
(3) as 

X = Y - 9 - Zi 

X = z 2 . 

Since the actual value of Y does not affect convergence (as can be verified 
by our proofs below), we may as well set it to be 0, and since C(Z±), £(Z 2 ) 
are symmetric around 0, the model written above under a non-centred 
parametrisation coincides with (3) under a centred parametrisation but with 
the error distributions interchanged. We first prove the results concerning 
the diagonal elements. 

The (C, C) model 

We prove the result by verifying the PTIP property. The result then fol- 
lows by Theorem 3.3. Notice that in this model, cq = fg(x)dx = 
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2 j!^ 9 ^ 2 fg(x)dx. We show that Vq is PTIP by demonstrating that for 
arbitrary k > 0, 

rv+h 

liminf / fe(x)/cg dx > . 

|6»|^00 Jy-k 

By symmetry, it is enough to prove this statement for large positive 9 values, 
so from now on we shall assume that 9 > y. 

For x < (y + 0)/2, 1 + (y - 9) 2 < 1 + 4(x - 9) 2 < 4(1 + (x - 9) 2 ), so that 
ce < 4/-7r(l + (y — #) 2 )- Moreover, notice that when x £ (y — k,y + k), then 
there exist a d > (depending on k,y), such that for all 8 > d, 

i+jj/z #) 2 > i + G/-fl) 2 > i /2 

l + (x-0) 2 " l + (y + £;-#) 2 " 7 ' 
Therefore, for 9 > d, 

v- fc /e(x)/Ce - U Ml + (y-x) 2 ) { l + { x-9) 2 ) dx 
1 /•»+* 1 

" 8 Jy-k vr(l + (y - x) 2 ) > ' 
which proves the result. The result for Pi is proved identically. 

The (E,E) model 

Without loss of generality we assume that fi(x) oc exp{— \x\}, and f2{x) oc 
exp{— |x[/(t}, g > 0. The stability of the Gibbs sampler depends on whether 
a < 1, (7 = 1 or <T > 1, thus we consider these cases separately. Again by 
symmetry it is enough to consider y < 9. 

1. o" = l: here we can write 

l e 2x-y-e^ x < y 

fe(x) = { \e-(°-y\ y< x <9 
\ e y +e - 2x , x>9. 

From this it is easy to demonstrate that E(@i\Q = 9 ) = (y + 9q)/2. 
Since all compact sets are small for the Markov chain {0 n } this is 
enough to demonstrate geometric ergodicity by Theorem 15.0.1 of [19]. 
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2. 



a > 1: here we can write: 



fe(x) 




Direct algebra shows that 



P3(0) P2(0) 



E{x-e | y e = e} = Pl (e)(Y-i)+\p 2 (e)+p 3 (eyi}e+ P2 (e)r(e)+^-- 

cr + 1 

where + P2(#) + Ps{6) = 1> and as 9 — > oo, P2(#) — (c + 

l)/(2ff),pi(e) -» O,r(0) -» 0. Therefore, 



and the model is DUR. Since V\ is easily seen to be GTIP, by part 1 
of Theorem 3.5, Vq is geometrically ergodic. 
3. a < 1: Here, in an analogous way to the above, we can demonstrate 
that Vq is RIP therefore by Theorem 3.3, Vo is uniformly ergodic. 
Due to symmetry, the results for V\ are proved in a similar fashion, 
notice however, that V\ is uniformly ergodic when a > 1. 

The model 

This is covered in [21, 24] and reviewed in Section 3.1. 
The (L,L) model 

We assume that fi(x) oc exp{— |x/cji| /3 }, f 2 (x) oc exp{— |x/cj2|^}, and we let 
a = (3/ {(3— 1). Again by symmetry we just consider the case y < 6. For large 
9, C(X\Y, = 6) converges weakly and in L 1 to a point mass at p9 + (1 — p)y 



As a result, neither nor V\ are GTIP, so it is not possible to establish 
geometric ergodicity using the DUR and PUR properties (which hold for 
this model) in conjunction with Theorem 3.5. Instead, we have to construct 



lim B{X - 6\Y,e = 9} < 



-2 



where 



P = — 



a 2 a + a 
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directly a geometric drift condition. However, this is rather easy. Notice that 
since C(Q \ X = x) is symmetric around x, we can find a b > such that 
E{|0| | X = x} < \x\ + b. Moreover, for any e > 0, there is some k > 0, such 
that for all 9\ > k, E{\X - y\ | Y = y, = 9} < (1 + e)p\0 - y\, thus 

E{|6i - y\ \& = 9 }<b + p(l + e)\9 - y\ 

which implies geometric ergodicity for Vq since compact sets can easily be 
seen to be small. The result for V\ is proved identically. 

The (C,G),(E,C) and (L,C) models 

We show that the model is RIP, therefore since Vq is PTIP, by Theorem 
3.3 Vq is uniformly ergodic, and by Theorem 3.4 V\ is not geometrically 
ergodic. Notice, however, that for any x, using dominated convergence we 
can show that cg/f2(\x — 9\) — > 1, as \9\ — > oo. The argument is that, for 
any u, f2(\u — 6\)/f2(\x — 9\) — > 1, and the ratio is bounded above (as a 
function of 9) by a function of u which is integrable with respect to fx, as 
long as fi has exponential tails or lighter, which is the case in the models 
considered here. However, since fg/cg — > fi(\y — x\), and this limit is a proper 
density, it follows that the corresponding distribution functions converge and 
C(X \Y = y,e = 9) -> £(|Zi - y\) as \9\ oo. 

The (G, E) model 

Calculations show that 

lim C(X\Y,B = 9) = N(y+af/a 2 ,crf), and lim C(X\Y,G = 9) = N(y-af/a 2 ,af), 

9— >oo 9^>— oo 

therefore Vq is PTIP (but not RIP) and by Theorem 3.3 uniformly ergodic. 
The above result, however, shows that the model is PUR, and since all 
conditions of Theorem 3.5 are satisfied, V\ is geometrically ergodic. 

The (L,E) model 

The result is proved as above. 

imsart-aos ver. 2007/01/24 file: papaspiliopoulos_roberts_stabilty-REVISION.tex date: February 5, 2008 



STABILITY OF GIBBS SAMPLER 29 

The (L, G) model 

Here (perhaps surprisingly) Vq is not PTIP but the model is DUR and PUR, 
and both Vq and V\ are GTIP so that Theorem 3.5 can be applied. 

□ 

PROOF OF Lemma 3.7 Consider the Gibbs sampler with initial value 
Xq which updates (Q,Q) first and then X. Direct calculation gives that 
C(Q | Y = y,X = x,e = 6) = Ga(l,(y - x) 2 /2), C(X \ Y = y,Q = 
9,Q = q)= N(9/(q + 1) + qy/(q + 1), l/(q + 1)), therefore C{X l -X \Y = 
U, Qi = q) = N (q(y - X Q )/ (q + 1), 1 + l/(q + 1)). However, since q — > in 
probability, when Xq — > oo, the algorithm is random walk-like in the tails 
and by Theorem 6.3 fails to be geometrically ergodic. □ 

PROOF OF Theorem 4.1 It is easy to demonstrate that the model is RID, 
lim £(X|Y, 9 = 6) = N p (0, E) . 

|6*|— >oo 

Therefore, V\ is PTIP and by Theorem 3.3 is uniformly ergodic. Since 
this implies that for the Gibbs sampler using Vq, 

lim c(e n+1 -e n \e n = e n ) = N(o,—^ T -) , 

Therefore by Theorem 6.3, geometric ergodicity fails. 
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